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Abstract 

A number of problems in a variety of fields are characterised by target dis- 
tributions with a multimodal structure in which the presence of several isolated 
local maxima dramatically reduces the efficiency of Markov chain Monte Carlo 
sampling algorithms. Several solutions, such as simulated tempering or the use 
of parallel chains, have been proposed to facilitate the exploration of the relevant 
parameter space. They provide effective strategies in the cases in which the di- 
mension of the parameter space is small and/or the computational costs are not 
a limiting factor. These approaches fail however in the case of high-dimensional 
spaces where the multimodal structure is induced by degeneracies between re- 
gions of the parameter space. In this paper we present a fully Markovian way 
to efficiently sample this kind of distribution based on the general Delayed Re- 
jection scheme with an arbitrary number of steps, and provide details for an 
efficient numerical implementation of the algorithm. 
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1. Introduction 

Bayesian inference has been gaining considerable momentum in recent years 
with the introduction of a number of numerical techniques to compute (marginalised) 
posterior density functions and marginalised likelihood (or evidence) that are 
at the heart of the implementat ion challenge of this approach. Markov chain 
Mont e Carlo (MCMC) methods (jHastingj . [l970t ICilks e^oH . Il995l ICamermanl 



[1993) are in particular popular and have shown the potential to tackle suc- 
cessfully these problems in a wide variety of fields. These methods become 
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increasingly computationally expensive as the dimensionality and complexity 
of the target distribution grows, and much effort is being devoted both at the 
theoretical and implementation level to improve the efficiency and robustness 
of these approaches. 

Metropolis-Hastings algorithms are employed in the overwhelming majority 
of the practical implementations of MCMC methods. In Metropolis-Hastings 
updates, a new state (or value of the parameters describing the problem at hand) 
is drawn from a proposal distribution and then, either accepted or rejected, 
depending on the value of the target distribution, with probability given by 
Equation ([T|). If the proposed state is rejected, the chain remains in the current 
state and the previous steps are repeated. Reducing the number of rejected 
proposals while still exploring the structure of the target distribution represents 
one of the most important goals in every MCMC applic ation. By doing this, 
one improves the MCMC algorithm in the lPeskunI (|l973h sense. 

In a variety of practical applications, an additional complication is encoun- 
tered, which is that the target distribution presents a multimodal structure, 
often with secondary maxima well separated from the mode (see Figure [T|) . In 
such situations, maintaining a high acceptance ratio while sampling the full 
structure of the target distribution becomes very difficult, leading to a very 
small efficiency of the algorithm. Several solutions have been proposed to in- 
crease the mixing and exploration ability of the chains, such as using parallel 
chains exploring the parameter space that, at a certain point, can be swapped 
(e.g. Metropolis-coupled MCT ylC algorithms, se e Geveii (|l99ll )) or other meth- 



ods like simulated annealing (IKirkpatrick et ali. 119831) or simulated tempering 
( Marinari and Parisi . 1992; Geyer and Thompsonl " 19951 ) . that consist in adding 
a 'temperature' factor to the target distribution in order to make it smoother at 
the beginning and therefore to promote the movement of the chain. Although 
these technique s work fairly well in low dimensional problems, it has been shown 
( Cornishl . |2008[ ) that they are not the best solution in cases where the dimen- 
sionality of the parameter space is large and the target distribution presents a 
multimodal structure in several dimensions at the same time, i.e. the parameter 
space shows very localised 'islands' in which the function reaches high values, 
as it can be seen in Figure [TJ In fact, evolving several parallel chains is com- 
putationally inefficient when the multimodal distribution only appears in some 
of the parameters. Simulated tempering techniques can help to improve the 
mixing of Markov chains, but in a high number of dimension this can hinder 
convergence, as shown bv ICorni"shl ( 2008 ). 

A problem that exhibits such multimodal dist ributions is the ana lysis of 
data from the Laser Interferometer Space Antenna ( Bender et al . 19981 ). which 
provides the motivation for our work. Figure [T] shows slices through an 8- 
dimensional parameter space, in the plane of the three parameters most respon- 
sible for the complicated structure of the target distribution. In order to sample 
this function, we desire an algorithm which is both fully Markovian, and efficient 
in exploring the parameter space. Further details of the specific application are 
given in Section [5]. 

Here we present a fully Markovian method to efficiently explore the parame- 
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Figure 1: Example of a multimodal target distribution where a normal MCMC algorithm 
becomes very inefficient. These figures show the colour plots of the likelihood function from 
a simulated gravitational wave signal from galactic white-dwarf binary systems observed with 
the Laser Interferometer Space Antenna taking "slices" through the parameter space in the 
three parameters that are most responsible for the structure of the likelihood function: (left) 
frequency-colatitude, ('centerj frequency— longitude and ('rig/i4j colatitude— longitude. The sig- 
nal's frequency is given in un its of the characteristic modulation frequency for this problem 
(/m = 1 yr~^) lICutleH . Il998ll and the two sky location angles are referenced to the ecliptic 
plane. The dashed-light lines represent the actual value of the parameters. 



ter space of this kind of problems, assuming that we have some prior knowledge 
about the typical scales of the structures of the target distribution in the rele- 
vant parameters. In general, a normal MCMC algorithm will be able to widely 
explore the parameter space until 'finding' one of the modes of the target distri- 
bution that will be sampled properly. The inefficiency comes from the fact that 
having a structure of the target distribution made up by different isolated modes 
('islands') in several dimensions, it becomes very unlikely to 'jump' between the 
different modes with normal proposals. By having a rough idea of the typical 
distances in the parameter space between two local maxima, one can attempt a 
big jump in the parameter space trying to fall in the neighbourhood of another 
mode (probably far away from the maximum), followed by multiple small steps 
exploring the new region of the parameter space (see Figure [3]). If, during this 
exploration the chain finds a higher maximum, the chain would continue evolu- 
tion from the new mode, but if not it would remain in the original mode. To 
implement this idea it is necessary to find some algorithm that allows delaying 
the rejection, but preserves the Markovian properties of the resulting chain. 

A powerful method to overcome the rejection of proposed states and there- 
fore to in crease the efficiency of Metropolis-Hastings MCMC methods was pro- 
posed bv iTiernev and Mirat (fl99 9l : Mira (2001) for fixed dimension problems 
and then extended bv lCreen and Miral (|200lt ) to transdimensional problems, and 
goes under the name of Delayed Rejection (DR). In short, with this method, 
when a new state is proposed and rejected, one can actually make a second pro- 
posal, in principle based on a different distribution which may depend on the 
rejected value, and make a decision (accept or reject) using a suitable probabil- 
ity. If the second value is rejected, one can then consider a third attempt and 
so on. Through this approach, one can achieve efficiency gains, while preserving 
the Markovian properties of the chain. In the recent years the Delayed Re- 
jection method has been successfully applied to a number of problems of very 
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diflFerent ar eas, su c h as particle filters ( Robert and Mengerseij. 
ity models ( Raggi . 20051). gravitational wav e searches ( Umstatter et al. 



2003). volatil- 



2004 



or object recognition ( Harkness and Greenl . .2000) and also it has been used in 
combination with othe r techniques to creat e high efficient methods, e.g. by 



Al-Awadhi et al. (2004); Haario et al. ( 2006^ . In all these applications of the 
DR algorithm, either the number of steps was very small (2-3) or the assump- 
tion of symmetrical proposals was made, simplifying the general expressi on for 
the ac ceptance probability of the DR chain from Eq. ^ here to Eq. (8) of lMira 
(|200ll ). The problem we are dealing with in this paper requires the application 
of the full general scheme of the DR algorithm with an arbitrary number of 
stages. Since this is the first time to our knowledge that this has been done, in 
Section [4] we explain all the non-trivial implementation issues. 

Moreover, MCMC methods in general and the DR approach in particular, 
do not provide a well defined algorithm but just the general scheme to tackle the 
problem. In general, actual algorithms that are efficient and robust are prob- 
lem specific and require a number of complex decisions, in particular regarding 
choices of proposal distributions. As the number of dimensions grows and the 
structure of the target distribution increases, these decisions are highly non- 
trivial. In this paper we provide a detailed analysis of the Delayed Rejection 
approach and provide a set of simple and general rules for the implementa- 
tion of this scheme in order to efficiently explore complicated multimodal target 
distributions. 

The paper is organised as follows: in Section [2] we review the basic concepts 
behind Delayed Rejection and demonstrate the advantage of this algorithm over 
a standard Markov chain exploration by comparing the variances of estimates 
made from the chains; in Section [3] we provide general key rules for the imple- 
mentation of the algorithm and useful results for the selection of the proposal 
distributions, and in particular. Section 13.31 summarises the actual procedure 
for the choice of parameters of the algorithm. In Section |4] we focus on the 
implementation issues related to the application of the general DR algorithm 
with an arbitrary number of stages; Section [5] provides an application of the al- 
gorithm to the case of an 8-dimension parameter problem related to the search 
for gravitational waves from white dwarf binary systems in (mock) data of the 
Laser Interferometer Space Antenna and Section [6] contains our conclusions and 
pointers to future work. Appendix [Xl provides some details about the derivation 
of the analytical results presented in the main body of the paper. 



2. The Delayed Rejection approach 

The Metropolis-Hastings algorithm is widely used in applications of Markov 
chain Monte Carlo methods to sample a target distribution Tr{x). Given a state 
of the Markov chain A, one proposes a new state A' drawn from a proposal 
distribution (or transition kernel) q{X,X'), the probability of A' given A. This 
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new state is accepted with probabihty 



a(A,A') = 1 A 



7T{y)q{X',X) 
TT{\)q{X,X') 



(1) 



and the chain remains at A with probabihty 1 — a(A, A'). In the previous equa- 
tion, the notation a Ab (for any real numbers a and b) stands for the minimum 
between a and b, and as a consequence the acceptance probability is limited to 
unity in the case where the ratio is > 1. The convergence to the target distri- 
bution TT is guaranteed through the acceptance probability, but the convergence 
rate or efficiency of the chain is highly dependent on the choice of the transition 
kernel q. Its choice is one of the most critical stages in actually implementing 
a Markov chain Monte Carlo algorithm. An effective way to improve the effi- 
ciency of the algor i thm has been prop os ed in the form of Delay ed Rejection by 
iTiernev and Miral (|l999l ): iMiral (|200l[) : ICreen and Miral (|2001f) which we now 
briefly summarise mainly to establish notation for the case of a problem with a 
fixed number of dimensions. 



2.1. Delayed Rejection 

Suppose the current state of the chain is A. A candidate, /3i, is generated 
from a proposal distribution qi{X, (3i) and accepted with probability 



ai(A,/3i) = lA 



7r(/?i)gi(/3i,A) 
7r(A)(Zi(A,A) 



(2) 



as in a standard Metropolis-Hastings algorithm, see Equation ([I]). If /3i is not 
accepted, instead of rejecting it, one can use this information and propose a new 
state /32 from a (in principle different) proposal distribution <72(A, /32) which 
may use information about the rejected state Pi. In order to maintain the same 
stationary distribution, the acceptance probability of this new candidate may 
be computed as 



a2(A,/3i,/32) = 1 A 



<p2)qi02jl)q2iP2jl,X) 


a- 


-ai(/32,/3i) 


7r{X) qi{Xji)q2{Xjij2) 


(1- 


ai(A,/3i) 





(3) 



which is derived by imposing that the acceptance probability must preserve the 
detailed balance separately at each stage. 

This procedure can be applied for an arbitrary number of stages: the result 
is a generalisation of Markov chain Monte Carlo methods that is called Delayed 
Rejection (DR). The generic i~th stage of the chain is as follows: if the state 
Pi-i is proposed and rejected, one can propose a new candidate /?; from the 
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proposal gi(A,/3i, . . . ,/3i) and accept it with probability 



a.,{X,Pi,...,Pi) = lA 



l-ai(A,A-i) 


1 - a2{Pt,Pt-i,l3t-2) 




1 - a^-i{Pi, ...Pi) 




l-ai(A,/3i) 


l-a2{Xjij2) 




1 - ai-i(A, . . 





Notice that with the notation we are using, ^^(A, /3i, . . . , /3i_i, represents 
the proposal probabihty of (3i given {X, (Ji, . . . , Pi-i} where the order of the 
parameters does matter. Sometimes, in the literature this also may be written 

as q^{f3^\X, I3i, . . . 

Due to the way in which the DR algorithm has been designed, i.e. by 
imposing detailed balance at each stage and deriving the acceptance probability 
that preserves it, the resulting chain is a reversible Markov chain with invariant 
distribution tt. Thus, the average along a sample path of the chain of a function 
/, is an asymptotically unbiased estimate of / /(a;)7r(a:)da:. 

The benefit of this method is that it increases the efficiency of sampling by 
delaying a possible rejection, see Sec. 12.21 furthermore, it allows us to propose 
a new candidate that can be drawn using the information of the past elements. 
These features make the Delayed Rejection algorithm very useful in a number 
of applications; in particular we shall use it to increase the efficiency in com- 
puting marginalised posterior distributions generated by integrating multimodal 
likelihood functions. 

An essential point of the DR scheme is that Eq. ([4]) has been derived by 
imposing that the backward path, from /3i to A, follows the forward path, from 
X to Pi, in reverse order. As a consequence, this property forces one to choose 
proposal distributions that preserve, in some way, the reversibility of the chain. 
As we shall show, this choice is complicated in any practical application; despite 
this fact, at each stage we will have the freedom to propose a new stage with a 
different proposal with respect to the previous one and to make use of all the 
past information. In the next section. Sec. [3l we discuss the details of this key 
point and provide a general method to actually obtain a Markov chain with 
Delayed Rejection, regardless of the specific target distribution. 

2.2. Efficiency increase with Delayed Rejection 

When a new state is proposed and rejected, instead of retaining the previ- 
ous state and duplicating an element of the Markov chain, the DR algorithm 
allows us to make a new proposal and consider it for acceptance. Naturally, this 
increases the transition probability of the elements of the chain within the differ- 
ent parts of the parameter space, improving the mixing of t he chain and l eading 
to a reduction of the asymptotic variance of any estimate (|Peskunl . [T973h . 

A more quantitative statement about the benefits of using the Delayed Rejec- 
tion algorithm when a transition with low acceptance probability is attempted. 



^(A) gi(A,/3i) g2(A,/?i,/32) ... <z.(A, . . . , A) 



(4) 
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Chain A (noDR): ©(Dd)®®®®®®®®®®®®®®--- 

Chain B (with DR): ffi_®_®_®_®___®___®_®_®J^^ ' 
Pj Pj p, P| P| I p^ Pj Pj Pj P| P| I p^ P| 

Figure 2: Example of the Markov chains that would be generated in the theoretical case 
considered in Sec. 12.21 every Mi elements drawn from a certain proposal, Pi, M2 transitions 
from a second proposal, P2, are attempted but accepted with lower probability. The Delayed 
Rejection algorithm attempts the same number (M2) of transitions drawn from P2 as the 
standard MCMC, but it only stores the one that is accepted. Different circled numbers 
represent possible different elements of the chain, whereas the repeated ones explicitly denote 
the rejected transitions drawn from P2. In this example we are considering Mi = 5 and 
M2 = 3. 



can be made by computing the var iance of an estimate, A, using the autocorrela- 
tion function of the Markov chain (|Sokaill98fll :l iGilks et aLl . ll995l : lGelman et al. 



1996I) . 



var(A) ^ llllE^ , (5) 

where N is the number of elements of the Markov chain, C\\{t) is the autocor- 
relation function of the parameter A at lag t and 

n— 1 ^ ^ n— 1 

From a theoretical point of view, we can demonstrate that the variance of an 
estimate made from a chain using Delayed Rejection is always smaller than that 
produced with a standard Markov chain. 

Let us consider a Markov chain generated from two different proposals, Pi 
and P2; one of them, say P2, has a lower acceptance probability. To establish 
notation, we assume that after Mi elements drawn from Pi, M2 transitions 
from P2 are attempted and all of them are rejected except for the last one (see 
Fig. [2]). This will generate a chain (A) with M2 completely correlated (repeated) 
elements every M1+M2 iterations, whose efficiency could be improved by group- 
ing all these M2 elements into a single one using the DR algorithm (chain B in 
Fig. [2]). Removing repeated elements means reducing the autocorrelation of the 
chain. Tint, or equivalently increasing the transition probability and, therefore, 
getting lower asymptotic variances of any estimate (.Pesku n. 1973;) ; but it also 
means reducing the number of elements, N, of the chain for the same computa- 
tional cost. So, qualitatively it is not clear which effect will dominate in Eq. 
but one can explicitly derive the expression that relates the variances of the 
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same estimate using chains A and B. With our notation, this is: 



varA(A) - varB(A) 
var_B(A) 



/ M2-I 
V Ml + M2 



(Ml + 1) 



Af/(Mi + l) 

h+ E PAA {{Ml + 1)^) 
n=l 



M2 - 1 

Ml + M2 



{Ml + l)coth 



2 + E Paa(") 
tanh 



- 1 



> 



where we have used the definition of the exponential autocorrelation time, 
p\\{n) = exp(— |n|/Tea;p). Equation ([7]) demonstrates that, for a fixed com- 
putational time, we always obtain a benefit from using the Delayed Rejection 
algorithm in terms of reducing the variance of any estimate; furthermore, this 
equation provides a quantitative result for this statement. 



- 1 



(7) 



3. Implementing Delayed Rejection schemes: Key rules 



The Delayed Rejection method presented in a formal way in the previous 
section, guarantees the stationarity and Markovian property of the resulting 
chain; concurrently, it provides ample freedom (i) to choose different proposal 
distributions at each stage of the DR algorithm and (ii) to use all the information 
from previous values of the chain. What is not guaranteed is the ability to 
achieve a non-negligible acceptance probability at some stage of the DR chain. 
In fact, we shall see that in the general case without following some key rules 
about the choice of proposals, the acceptance probability always tends to zero, 
which in turn would defeat the very reason to consider a Delayed Rejection 
algorithm. 

At the i-th state of the DR chain, Pi is generated from qi{X, /3i, . . . , (3i) and 
accepted with an acceptance probability given by Equation This expression 
is the result of the product of three different kinds of terms: 

• The likelihood ratio between the new proposed element of the chain, Pi, 
and the first one. A: 



7r{X) 

• The ratio of proposal probabilities 

qi{Pi,Pi^i) q2{Pi,Pt-i,Pi^2) 



qi{f3i,Pi- 



.A) 



qi{X,Pl) q2{X,Pl,P2) ... g,(A,/3i,...,A) 
The ratio of complementary acceptance probabilities. 



(8) 



(9) 



l-ai(A,A-i) 


1 - a2{Pi, Pi-i, Pi^2) 




1 - Q!i_l(/3i, ...Pi) 




l-ai(A,/3i) 


1 - a2{X,Pi,P2) 




1 - Q!,_i(A, . . . A-i) 





(10) 
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In Section 3] we will show that the latter term is (in most of the cases) of order 
unity, and we will discuss ho w to compute it e f ficiently. Followin K the idea 
of a Metropolis-Hastings ratio ( Gilks et ai . 19951 : iGamerman . we would 



like the acceptance probability ^ to be dominated by the likelihood ratio, 
Equation ([5|), and therefore to have typical values for the proposals ratio (O of 
order unity; this will only be valid if the reversibility of the DR chain, drawn 
from the different proposal distributions, is preserved. Having a handle on 
the ratio of proposal probabilities is therefore essential for a Delayed Rejection 
scheme, and we will now concentrate on this specific point. 

Let us focus now on the ratio of proposal probabilities. Equation In 
the denominator, we are evaluating proposal probabilities of the normal chain, 
in the sense that we are computing the probability of proposing (3j (j < i) 
given {A, Pi, ... , (3j-i} when Pj was indeed generated from qj{X, Pi, ... , Pj). On 
the other hand, the proposal probabilities at the numerator correspond to the 
reverse chain, i.e. we are computing what would be the probability of proposing 
Pj (j < i) assuming that the old elements of the chain are {Pi,Pi-i, . . . ,/3j+i}, 
when in reality Pj was generated from qj{X,Pi, . . . ,Pj). For this reason it can 
be shown (see Appendix [A| that on average, the ratio of proposal probabilities 
([9]) is smaller than one; how small with respect to 7r(/3i)/7r(A) is at the very 
heart of the problem and it will dramatically affect the efficiency of the MCMC. 
In turn, the actual value of the term ((9]) depends on the reversibility of the DR 
chain. 

The main application we have in mind for the Delayed Rejection algorithm 
is a Markovian way to sample target distributions that are characterised by well 
separated local maxima in parameter space. In order to achieve this goal we 
proceed as follows: at a certain stage of a standard Metropolis-Hastings MCMC 
chain - likely exploring a secondary mode of the distribution - we start a DR 
chain with a big jump that attempts to reach the mode of the distribution; such 
transition, likely to be rejected but hopefully closer to the mode, is followed by 
many small ones in order to explore the new region of the parameter space (see 
Figure [3] for a cartoon exemplifying the situation): 



U = l) qi{X,Pi) = qa{X,Pi) 



{j > 2) q,{X, Pi,...,P,) = qb(x[X, Pj-i\,P,) 



mainly proposing big jumps 
(travel to another local maximum) 



mainly proposing small jumps 
(explore the new region) 



Thus, it is inevitable that the resulting chain will have some degree of non- 
reversibility. However, this behaviour can be controlled in order to yield a 
non- negligible acceptance probability from Eq. We will devote Section [57T] 
to address this essential point. 

From here on and for the sake of simplicity, we will assume that during the 
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Figure 3: Graphical representation of the idea we want to implement to efficiently explore 
different maxima of the target distribution using the Delayed Rejection algorithm: an initial 
big jump (proposed by qa) which may not land in the peak of the neighbourring node, followed 
by multiple small steps (drawn from gt,) in order to explore the region we landed. In this 
paper we show that this exploration cannot be "up-hill" , as in a standard Metropolis-Hastings 
method, but blind in order to achieve reasonable acceptance probabilities. 

Delayed Rejection algorithm, only one parameter is being updated. All the con- 
clusions and results contained in the following sections can be straightforwardly 
extended to the case of updating several independent parameters; furthermore 
all the guiding principles (but not the numerical results) would be unaffected 
for the general case of updating correlated parameters. 

3.1. Choice of proposals: Functional form 

The first and last terms of the product of proposal probabilities, 



are the ones which involve either the initial 'big jump' proposal or the elements 
of the chain generated from it. In the numerator of the first fraction, we are 
evaluating a 'big jump' proposal, qa, but with two elements of the chain (mostly 
probably) separated by a small distance in the parameter space. The opposite 
case happens in the numerator of the other fraction, where two elements of the 
chain with a likely big separation in the parameter space are evaluated with an 
'small jump' proposal, qb. 

Having a qa proposal only contemplating big jumps and qb small jumps, 
would result in a tiny value for the ratio of proposal probabilities, Eq. ([9]), very 
difficult to compensate with the likelihood ratio in order to finally get non- 
negligible values for the acceptance probability in Equation The solution to 
this problem comes through including a certain probability of doing small jumps 
in qa and big jump proposals in qb and therefore, all our proposals become 
a mixture of three Gaussian distributions (3-Gaussian, in future references), 
symmetrical around the central mode. 

In Figure[3]we graphically represent the proposal functions that we are going 



qa{Pi,P^-l) 
9a(A,/3l) 



and 



qb{x[X,. . . ,(3i-i],(3i) 



qb{x[(3t, . . . A) 



(11) 
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X - X X - X 



Figure 4: Proposal distributions used in the general Delayed Rejection algorithm in order 
to efficiently explore different maxima of the target distribution. We force the 3-Gaussian 
functions to be symmetrical and therefore, they can be characterised by 5 parameters, 
{cri,cr2, /J,, Na, Ni,}, graphically represented in this figure. For each particular problem, they 
will be chosen to be adapted to the structure of the target distribution: /i being the typical 
distance between different maxima and ai their typical width. The optimal values for Na and 
Ni, are found as a compromise between efficiently exploring the parameter space and having 
non-negligible acceptance probabilities, see Sec. [3] for further discussions. 



to use, that can be parametrised as: 



ga,b(a;,a;) = 



1 - Na,b 



271(72 



exp 



{x + fi — xY 



2aj 



1 



1 - Na,b 1 



exp 



(x — xY 



2a\ 



27r(T2 



exp 



(x — [i — xY 



2ol 



(12) 



where x may be a function of the old elements of the DR chain and Na.h rep- 
resents the probability of proposing a value from the central Gaussian. The 
extreme case where qa is always proposing big jumps and qb small ones, would 
be given by setting = and iVf, = 1, but a compromise must be reached 
between having an efficient way to explore multimodal target distributions and 
obtaining non-negligible acceptance probabilities. 

In general, one can compute the expected values of the terms in Eq. Ijlip 
for the whole parameter space {cti, (T2, /i, A^a, -^&}, trying to quantify the con- 
tribution of these ratios of proposal probabilities to the final acceptance prob- 
ability. In Figure [5] we plot the results obtained numerically, which are valid 
in the whole parameter space. An analytical expression can also be derived for 
the cases where the separation between the Gaussians, /i, is big enough (i.e. 
H » (Ji): 



E 



log 



qa{Pi,Pi-i) qb{x[f3^,...,(3i],X) 



9a(A,/3i) qb{x[X, . . . , (3i^i], (3i) 



(Na - Nb) \o. 




(13) 

In this case, the losses in acceptance probability due to an asymmetrical proposal 
(AP) become independent of {f7i,cr2,/i} as we can see in Figure [SJ where the 
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Na — Nh maps tend to a common plot as we go to smaller values oi ai/ fi (low- 
right plots of the grid). 

The reason why the analytical expression is only valid in the limit of small 
values of ai/ /i is due to the assumptions made for its derivation (see Appendix [Xl 
for further details). In Figure [S] we represent the r.m.s. differences of the 
analytical solution compared with the numerical one in a ai//i — 1T2/M map, 
where we can notice that the analytical expression will be applicable when 
ci/fj, ~ o'2//i < 0.1, i.e. when the separation between Gaussians is > 10 times 
bigger than their typical width. 

In Figure O as well as in Eq. , we realise that choosing the solution that 
the intuition may suggest, with qa always drawing big jumps (i.e. Na = 0) and qb 
small ones (iVb = 1) would yield an infinite losqj in the acceptance probability. 
However, from these plots one can see that we can still have a very efficient 
algorithm by choosing small (big) values of Na (Nb) and getting proposal ratios 
of the order of e^^ — e^^, which is acceptable. 

3.2. Choice of proposals: Central location 

The other non-trivial aspect that one should take into account in order 
to avoid ending up with negligible acceptance probabilities in the DR algo- 
rithm is the way that the central location of the mixture of Gaussians proposal, 
x[X, . . . , (3i], is chosen based on the preceding sample path. Any of the generic 
ratios of proposal probabilities that appear in Eq. (j4l), 

q2{(3i,(3i-i,(3t-2) ... gi^i(/3i, . . . ,/3i) ^ qb{x[Pi, (3^-l], (3^^2) ■■■ qb{x[Pi, . . . , (32], (3i) 

92(A,/3i,/32) ... gi_i(A, . . . ,/3i_i) qb{x[X, Pi], P2) ••■ g6(x[A, . . . , /3i_2], A-i) 

(14) 

contains the proposal function evaluated over the reverse chain in the numerator 
and evaluated over the normal chain in the denominator. The latter term has 
no issues, because the chain is evaluated with the same probability density 
function from which it was drawn, but this is not the case for the term in the 
numerator, where the f3j {1 < j < i — 1) element of the chain was generated 
from a 3-Gaussian function centred at x[X,Pi, . . . = Xf but is evaluated 

by a function centred at x[l3i, . . . , Pj+i] = Xr- 

Any difference between these two quantities, A = Xf — Xr, means having a 
numerator (on average) smaller than the denominator. In particular, in the limit 
of big separation between Gaussians {ci/ fi small), one can derive an analytical 
expression (see Appendix E| for the expected value of a single proposal ratio 
due to a shift. A, between the value where the proposal function was centred 
and the central location of the evaluation function: 



^In the case of Gaussians with a large separation this is strictly true; if the separation is 
not that big, then the losses are merely huge. 



E 



log 



qb{x[(3,,...,f]j+i],Pj) 
qb(x[X, . . . ,/3,-i],(3j) 
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Figure 5: Expected value of the logarithm of proposal functions affected by the fact that 
we are using an asymmetrical proposal (see Eq. jlTJl) - first a big jump in the parameter 
space and then a series of small ones to explore the new region - to efficiently move between 
different modes of the target distribution. These results were obtained numerically and they 
depend on 4 independent parameters {ai / fi, (T2/ fi, Na, Ni,} (the panels without any contour 
lines are because they represent only values greater than —1). Since the expected values of the 
proposals are always smaller than one, we'll refer to them as the losses (in the final acceptance 
probability) due to an asymmetrical proposal (AP). 
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Figure 6: Root-mean-squared values of the differences (in a Na — A^b map, like the ones of 
Figure [Sjl between the numerical results of Figure [5] and the analytical expression given by 
Eq. I I13I I. that was obtained under the assumption that the three Gaussians are well separated. 
We plot the results as a function of ui/fi and (72 //i, getting a quantitative result to evaluate 
the validity range of Eq. I I13I I. 



We see that the contribution of each pair of proposals, wiU always be smaller 
than 1 for any A, so without a proper way of computing x, this effect will appear 
in all the terms inducing a systematic shift of the acceptance probability to very 
small (negligible) values. Therefore, it is crucial to find an algorithm in which the 
central location of the proposal is independent from the particular sample we are 
using to compute x, or in other words a;[A, . . . , Pj-i] — x[Pi, . . . , Vj € 

This fact automatically rules out some logical options that one might have in 
mind as an efficient way to explore the new region of the parameter space, such 
as (i) using, with a certain probability, the past element of the DR chain with 
the highest likelihood to draw the next elements, trying to emulate a Metropolis- 
Hastings algorithm, or (ii) always proposing from the first point in parameter 
space in which one lands after the 'big jump'. On the other hand, we have the 
opposite case that would be using the last element of the chain to draw the next 
one, but this would represent a complete blind random walk, and therefore a 
very inefficient way to explore the parameter space given some prior knowledge 
of the target distribution. Thus, the exploration of the new region, depicted 
in Fig. [3l has to be performed blindly rather than "up-hill" , as would be in a 
standard Metropolis-Hastings algorithm. The real benefi ts of the DR co me from 
increasing the efficiency of the MCMC algorithm in the iPeskunI (|l973f) sense. 

There are several ways to build an algorithm that at the same time explores 
efficiently the parameter space and yields A ~ 0. The solution we propose here 
consists in using the mean of all the past elements of the chain after the big 
jumjj^ as a central location for the next proposal, 



.A 



mean(/3i, . . . ,/3j_i) 



(16) 



We use all the elements of the DR chain except the first one. 
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although, of course, other options may be considered. The main motivation 
for our decision is that we are using a statistical property of the chain, which 
has the same expected value independently of the particular sample of elements 
we are considering. Moreover, by using the mean value, the generated central 
proposal values will stabilise near the first point where one lands after the big 
jump, which is also convenient in order to explore this new region. 

Once we have made this choice, one can numerically compute the expected 
values of the proposals ratio of Equation (i.e. losses in the final acceptance 
probability due to the central proposal evolution (CPE)) in the whole parameter 
space, here defined by {cti, cr2, /j,, Nb, Uor}, where is the number of elements 
of the DR chain. In Figure [71 we plot these results in the same way as in 
Sec. 13. 1[ i.e. a grid in the ai/iJ, — a2/fJ, space of maps in the Ni, — nj,^ plane. On 
this occasion it was not possible to get an analytical expression as we justify in 
Appendix [A) 

By looking at Figure [7] one can see that the losses due to CPE are very 
sensitive to the actual values oiai/ fx. When the Gaussians have some separation 
(i.e. ail becomes small), we need very large values of in order not to obtain 
negligible contributions to the acceptance probability. The dependency on the 
number of stages that we consider in a Delayed Rejection, tidr, is not that 
important: at the beginning, tidr < 100, the losses increase as we add more 
steps to the DR chain, but after that first period the values stabilise ending up 
almost constant; in fact. Figure [7] is plotted only up to n^^ = 500, although we 
checked that the contour lines are almost constant for tidr > 500. 



3.3. Delayed Rejection in practice: Choice of parameters 

In the previous sections we have discussed all the possible problems that one 
can find trying to use the Delayed Rejection algorithm as an efhcient method 
to explore the local maxima of the target distribution, and their solutions. The 
purpose of this subsection is to give the general scheme of how to proceed in 
a real problem and how to make the selection of parameters in light of our 
results, in order to get an efficient algorithm with non-negligible acceptance 
probabilities. 

We use the general DR algorithm introduced bv iTiernev and Mira ( 1999^ : 
Miral ( 2001 ). choosing as proposal probability functions two kinds of 3-Gaussian 



distributions (one qa to propose an initial big jump in the parameter space trying 
to land in the neighbourhood of a new maximum, and an arbitrary number of 
to explore this new region where we landed) parametrised according to Eq. (I12p 
and graphically represented in Figure ID This gives a set of six parameters, 
{cti, (72, /i, iVa, A'b, riDR} that we are free to tackle at best the problem at hand. 
The way to proceed should be the following: 



^Of course, given a sample of A'^ elements of a certain random variable, their mean will be 
closer to the expected value as A'^ increases, so we expect to have bigger A's with the first 
elements of the chain. 
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Figure 7: Expected value of the logarithm of proposal functions from Eq. I I14I I obtained 
numerically for the whole parameter space, {rri/ fi, a2/ fi, Nij^tidr}, using the same kind of 
representation as in Figure \5\ These results are obtained after having made the decision of 
computing x[A, . . . , f3i] as the mean value of all the elements of chain after the big jump (see 
Eq. Ijl6|l ) . The reason why the expected values are always smaller than 1 is because the central 
location of the proposals evolves; that's why we will refer to these results as the losses due to 
the central proposal evolution (CPE). 
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1. The first three parameters, {cri, (T2, m}, are chosen to have a proposal 
adapted to the multimodal structure of the target distribution, /i being 
the typical distance between two maxima of the target distribution and 
Gi their typical width. So a rough previous knowledge of the target is 
required. 

2. Once we know the first three parameters, one can go to the plots of Fig- 
ures [5] and [7] and look at the maps corresponding to the closest a\l [i — aij \x 
values they have. In particular, one first can look at losses due to the cen- 
tral proposal evolution ( CPE) (Figure[7]) to select an Njj value high enoug ifl 
to have acceptable expected values of this ratio of proposals. Normally 
one works with rinR > 100, where the results are almost independent of 
this parameter. 

3. After knowing the appropriate Nf, value to use, one has to look at the losses 
due to an asymmetrical proposal (AP) (Figure[5]) in order to get the value 
of Na- Although in principle, Na should be as small as possible in order 
to mostly make big jumps at the first step, sometime^ it is necessary to 
choose larger values in order to reduce the losses in AP. Having a large 
Na value means that not all the DR chains will start with a big jump 
in the parameter space; this is however not a problem because although 
sometimes the DR chain will initially explore the same maximum in which 
the chain is, this is desirable once the global maximum is reached. 

4. Since the dependency with 71dr is very weak, the criterion to choose this 
last parameter will be mainly driven by the constraints on the computa- 
tional costs. 



4. Numerical Implementation 



The Delayed Rejection algorithm has been applied to many problems (lAl-Awadhi et al. j 

2004; U mstatter eV adl2004HRaggil . l2005HHarkness and GreenlT2000l : [Robert and Mengersen . 
2003tiHaario et aLl . l2006i r but this is the first time to our knowledge that it has 



been implemented in its general scheme with an arbitrary number of stages. 
In this section we focus on practical numerical implementation issues that one 
faces in applying a general DR scheme (with no simplifications) with tidr steps. 

4-1- Efficient computation of the transition probabilities 

The general expression for the acceptance probability of the N-th st age 



(n < n-Q-i,) of the DR, see Eq. (|4]), can be rewritten following iMiral (|200l[) as 



^The tolerance range depends on the typical ratio between the likelihood of the main 
maximum and the likelihood of the secondary ones. In our experience, allowing losses of the 
order of — is acceptable. 

^Mostly when we are facing a problem with highly separated maxima, that requires high 
values of Nf, in order to avoid huge losses due to CPE. 
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afi{X, Pi, . . . , (3^) = 1 A Since the denominator contains the terms evaluat- 
ing the chain in the natural order, its value after adding a new stage can be 
computed recursively, reusing many of information from the previous stage, 

Dn = -Dk-1 gN(A, /?!, . . . ,/3n) l-aN-i(A,.../3N-i) ■ (17) 

On the other hand, all the terms of the numerator must be computed at 
each step, 

Nti = Tr{P^)qi{P^,(3t^^i)...qfi{f3^,...,X) 1 - ai(/3N, /3n-i) •■■ 1 - aN-i(/3N, ■ • ■ , /3i) 

(18) 

Furthermore, since the N-th acceptance probability involves N — 1 a's, unless 
we reuse the results of previous iterations of the DR chain, we would have to 
compute 2^^"^ additional acceptance probabilities for the numerator of the N-th 
stage. 

Fortunately, most of the terms mentioned above are repeated or computed in 
previous stages of the DR chain, and only N new terms have to be computed at 
the N-th stage. Here we describe a way to efficiently implement the algorithm so 
as to avoid unnecessary computations, resulting in the maximum efficiency. Our 
procedure is based on a property of the N-th term of the acceptance probability 
that relates the result from evaluating the chain in the forward and reverse 
orders, 

Qn(A, . . . ,/3n) ^ ^j^g^ 

Q;n(/3n, . . . , A) -Dn 

From the results of Eq. (fTT]) and (fT9|) . one can build a table like the one 
shown in Figure [5] a row at a time, starting each row from the right hand side. 
By making use of Equation (|17p . the denominator of any element of the table 
can be computed using the information from the element just one position up 
(A); and the numerator will involve all the elements at its right hand side with 
the parameters in the reverse order (B); here is where we make use of Eq. (|19p . 
Of course, only the elements of the leftmost column will correspond to the real 
acceptance probabilities of a certain stage. 

One extra benefit from always working with chains in the 'normal' order, 
and just using Eq. (fT9|) to compute the chains with 'reverse' order is that we 
can easily reuse proposal probability values computed at previous elements. 
This already was the case for the denominator as we can see in Eq. dni), which 
contains only a single new proposal probability to compute. For the numerators 
it turns out that the proposal probabilities are repeated as we move left in the 
table of Figure [51 so only a single value needs to be computed for each new 
element of the table. 

This procedure is completely general and independent of the proposal dis- 
tributions that are used. In terms of computational costs, we can see from 
Figure [8] that building a DR chain with N stages will involve the computation 
of « ^(n^ -I- n) terms, whereas a normal chain would have involved ~ N. Of 
course, in the cases that one can simplify the problem working with proposal 
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Figure 8: Example of the table that need to be built in order to efficiently implement a 
general Delayed Rejection algorithm with an arbitrary number of stages, riDH . Only the terms 
in the first column represent the actual acceptance probabilities of the DR chain, but all the 
others are required in order to compute those first ones. The table must be constructed by 
rows, starting each row with the element at the rightmost end. For an arbitrary element 
of the table, its denominator can be computed reusing the calculations of the element just 
above (see Eq. I I17I I) (A), and for the numerator we will use the previous calculations done 
for all the elements at its right hand side (see Eq. il8l together with Eq. I I19I I) (A). This is 
the most efficient way to proceed in the implementation of a general DR algorithm, reducing 
the number of new quantities to be computed at the new N-th stage from 2'"^^ (in the most 
inefficient scenario) to N. 
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functions that do not depend on all the past elements of the chain, then the 
general scheme can be simplified further, giving a faster algorithm as can be 
seen in all the previous applications of the Delayed Rejection method. 

4-2. Avoiding numerical divergences 

The general expression for the acceptance probability at the N-th stage of 
the Delayed Rejection, Eq. ([5]), contains 2(n — 1) factors that correspond to 
(1 — ays. Since = 1 it is possible that some of these terms could 

be zercH. In a real implementation, this fact must be taken into account when 
calculating the acceptance probability in the Delayed Rejection chain. In this 
section we will discuss this apparent problem, and show that a finite acceptance 
probability is always found. 

In addition to Eq. (jl9p . another conclusion that can be drawn from com- 
paring the acceptance probabilities of the forward and reverse versions of one 
chain, is that if one of them is not equal to 1, the other must be 1. Taking this 
into account and looking at the (1 — a) terms that appear in Eq. one can 
see that: 

# zeros in = # aj^i{Pi, . . . , ^j) = 1 , i < j <N, 



# zeros in Nt^ — # 



a^-jiPj, . . . , 7^ 1 , i<j<N. (20) 



In other words and in terms of the procedure explained in Figure [SI this means 
that the number of zeros in the denominator of a certain acceptance probability 
will be equal to the number of as above it in the diagram that are equal to one; 
whereas the number of zeros in the numerator will be equal to the number of 
as on its right hand side not equal to unity. 

Making use of Equations (|^D|) , and focusing on the actual acceptance prob- 
abilities in the leftmost column of Figure [51 we can see that: 

• The number of zeros in denominator will be zero, because if it were not 
the case, then a previous stage of the Delayed Rejection would have been 
accepted. 

• Therefore, in order to have a non-zero acceptance probability, all the as 
to the right at the same row of the diagram of Figure [5| must be one. 

To show that the second condition is indeed possible, we will take as example 
the case of the third stage acceptance probabihty, a3{X, Pi, P2, Ps)- We have 
already estabfished that the ai(A, /3i) and a2(A, Pi, (32) above cannot be one, so 
we shall show that the entries to the right can be one, starting with cti{P2T Ps)- 



°In this case, the asymptotic behavior of the numerator and denominator is the same, so 
their ratio would be finite. 
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• As ai(32, Ps) = "'^^^1'^^^^^'^^^ and the q functions are symmetrical in the 

order of their arguments, we only need 7r(/33) > it {(32) in order to have an 
acceptance probability equal to one. 

• Next, the term Q!2(/3i, /32, Pd.) can be equal to one, in either of cases 

— Any of the terms above it, in this case just ai(/9i,/32), is one, which 
would make the denominator equal to zero. In this case we take the 
minimum of the ratio and 1, which is of course 1. 

— Otherwise, the ratio is finite, and can be > 1. 

Therefore, we see in this example that both of the terms to the right can 
be one, and therefore that a3(A, /32, /Js) can be non-zero. From Figure [H] it 
is easy to see that the same argument applies at any stage of the DR. So we 
see that despite the possibility of zeros existing in the intermediate terms in 
Figure [51 the final acceptance probability at all stages of the Delayed Rejection 
chain is finite, as we desire. 



5. An application: Analysis of gravitational-wave data 

In this section we present results obtained by applying the Delayed Rejec- 
tion algorithm described above to a specific problem: the search for gravitational 
waves emitted by short-period close compact binary systems in th e data of the 
Laser Interferometer Space Antenna (LISA) ( Bender et al. . 1998[ ). LISA is a 



space based laser interferometer to survey the sky in the gravitational-wave ob- 
servational window and is expected to observe a variety of sources, especially 
binary systems: galactic close stellar-mass compact objects; high-redshift mas- 
sive black hole binaries; and also stellar mass black holes or degenerate stars 
orbiting a massive black in the centre of galactic nuclei. The mission is cur- 
rently being developed by ESA and NASA with an expected launch date in 
the time frame 2018-I-. In essence, the LISA data set consists of three time 
series that contain full information about all the sources, as the instrument is 
an all-sky monitor and the expected sources are, for the overwhelming majority, 
long-lived compared to the mission lifetime; so the signals are always "on" and 
overlap with each others. Most of the signals will be buried in the noise, and 
optimal filtering needs to be applied to positively identify them. The analysis 
of the LISA data provides an interesting challenge to disentangle an unknown 
number (of the order of ~ 10*) of signals overlapping in time and frequency, 
and superimposed to the instrumental noise and foreground radiation produced 
by the brightness of the gravitational wave sky in this frequency region. Every 
individual signal is described by « 10 parameters, and therefore one looks for 
fitting ~ 10^ parameters out of k, 10*" data points. 

Bayesian methods have proven to be particularly promising in tackling these 
issues, and a vigorous effort is currently on-going to develop these techniques 
for essentially the whole range of signals that one expects that LISA will be 
able to observe. Due to the specific nature of the signals and the instrument 
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motion during the observation time, the target distribution shows many sec- 
ondary maxima and makes the problem of efficiently and robustly comput- 
ing the (marginalised) posterior density function particularly hard. A consid- 
erable variety of Monte Carlo integration methods based on the Metropolis- 
Hastings acceptance criterion have been developed, and most of them rely 
crucially on ad-hoc strat egies to probe multimodal ta rget functions, for stel- 
lar ma ss binary systems (ICrowder and Cornish . 2007 ). massive blac k hole bi- 
narie s ( Cornis h and Porteii 2007 ) and ex treme-mass ratio inspirals ( Cornish , 
20081: iGair et al. 20081: iBabak et a^.l ■ l2009^ : although recently, parallel temper- 
ing techniques have shown a good perfor mance sampling the target distribution 
from a single stellar mass binary system ( Littenberg and Cornish . 2009t ). This 
is an ideal arena in which a Delayed Rejection scheme would provide significant 
advantages to MCMC analysis algorithms. 

In short, the test analysis that we consider here can be summarised as follows. 
We have a data set 

d{t) = n{t;Cn) + h{t-Q (21) 

which consists of the superposition of noise n[t] C,n) and gravitational-wave signal 
/i(t;Cs), described by a vector of unknown parameters Cn and Cs, respectively. 
We will indicate with C, — {Cn, Cs} the vector of all the M unknown parameters 
that describe the problem. By applying the Bayes' theorem, we can compute 
the joint posterior density function (PDF), p{C,\d), of C, given the data sets, d. 



p(CM) 



v{C)C{d\o 
p{d) 



(22) 



where C{d\C,) is the likelihood function and p(C) the prior probability density 
function; p{d) is the marginal likelihood or evidence that for the problem at 
hand plays simply the role of a normalisation constant. Given a sub-set of 
parameters, say Ci7---iCmj one wants to compute the marginalised posterior 
density function: 



p(Ci 



, CmM) = j dCm-(-l ■ ■ ■ j 



M ■ 



p{d) 



(23) 



The case that we consider here is the simplest one for the LISA analysis problem: 
the computation of the posterior density function of the parameters that char- 
acterise a single stellar mass compact binary system in Gaussian and stationary 
data of unknown spectral density. However, this specific problem provides all 
the conceptual issues that we will be presented in a real analysis and more com- 
plex analysis tasks, such as the search and computation of the posterior density 
function for (spinning) black hole binary systems and extreme-mass ratio inspi- 
rals. 

Let us consider a gravitational wave signal produced by a short-period 
solar-mass compact binary system in our galaxy that is sufficiently far from 
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coalescence that its frequency does not change over the observational tim^. 
For a distant observer at rest (or moving with constant velocity with respect 
to the source), such signal would simply be a sinusoid, and the analysis of 
the data completely trivial. However, stellar-mass compact objects radiat- 
ing in the LISA frequency window are long-lived, and during the observation 
time Tobs of several years, the instrument changes position and orientation, 
which in turns produces phase and amplitude modulations in the radiation 
recorded at the instrument output. Such modulations depend on the source 
frequency and sky location. The power of the signal is spread over several 
frequency bins: LISA's change of orientation causes a frequency shift of width 
2/Tobs ~ 6.5 X 10~* Hz and the instrument motion around the Sun with velocity 
« 30 km s^^ and period 1 yr produces a Doppler modulation with typical 
width « (v©/c)/o ~ 10^'' (/o/l mHz) Hz. We note that these frequency shifts 
are much smaller than the range over which the instrument noise varies, and 
we will assume that the noise contribution (which is actually unknown) has a 
constant variance af^ . In summary the total number of parameters that describe 
the problem - i.e. the dimensionality of C - is M — 8: 

Cs = {/o,-4,0,6','0,i,(/3o} , 

Cn = (24) 

where /o is the constant frequency of the signal (with respect to the Solar System 
Barycentre), A is the amplitude, the longitude (p and co-latitude 9 are the two 
angles that define the source's sky location, -0 is the polarization angle of the 
gravitational wave; the inclination angle between the angular momentum of the 
system and an unitary vector parallel to the line of sight is defined by i, and ipo 
is a constant that fixes the initial phase of the signal. An example of a typical 
signal in frequency and time domain is shown in Figure[9l and an example of the 
resulting target distribution, that presents a multimodal structure, in particular 
in the frequency space and also in the two angles that define the sky location, 
is shown in Figure [TJ Thus, for this particular problem it is crucial to have an 
MCMC algorithm capable of efficiently exploring the different maxima of the 
target distribution, because otherwise on e can get in t roubl e trying to find the 
right signal inside the data as we show in iTrias et all (|2008l) . 



For the specific example presented here, we consider a mock LISA data set 



taken from the release IB of the Mock LISA Data Challenges (jBabak et al. 



2008[ ) and in particular, the data set MLDC-lB.l.lc that contains a single 



source at frequency around 10 mHz added to zero-mean Gaussian and station- 
ary noise with signal-to-noise ratio of 15. The implementation of the Delayed 
Rejection algorithm used in the analysis is done by following the steps discussed 



'^In reality the frequency derivative, /, is always different from zero; liowever as long as 
f^obs ^' which is satisfied for the majority of close binary systems in the LISA band, the 
change of frequency can be neglected. It is trivial (although it may affects the computation 
time) to account for the change of frequency during the observation time, but for sake of 
simplicity this is neglected here. 
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Figure 9: Representative example of a (mock) LISA data set and the signal that we want 
to identify in the (a) time and (b) frequency domains. This particular signal corresponds to 
the (simulated) LISA'S response to a gravitational wave emitted by a stellar-mass compact 
binary object generating a monochromatic signal at ^ 10 mHz, that is modulated in both 
amplitude and phase, due to the change of position and orientation of the detector during 
the 1 yr of observational time; the resulting signal-to-noise ratio is 15. The top panel of (b) 
represents the Fourier transform of the data and signal sets plotted in (a) , where we can see 
that the power is spread over several frequency bins because of the LISA motion, and the 
bottom plot of (b) corresponds to the marginalised and maximised likelihood function over 
all the parameters, 7, except the frequency of the signal; the dashed line shows the actual 
value of the signal's frequency. The multimodal structure observed in the frequency space is 
also present in the two angles that define the sky location (e.g. see Figure[T] although there, 
another set of parameter values were used), which dramatically reduces the efficiency of the 
MCMC algorithm unless a DR algorithm is implemented; see output in Figure [TOl 
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in Section 13.31 i.e. we start analysing the structure of the target distribution, 
and we derive that the optimal parameters for our proposals in frequency are 
(7i = 0.45 yr~^; <J2 = 0.2 yr"-'^ and /i — 1.25 yr^^. By looking at the appro- 
priate plot of Figure [7] (in our case it would be approximately in between the 
second and third rows of third column) one realises that for the problem at 
hand — 0.95 is an appropriate value, since the expected value of the ratio 
of proposals is ^ e~^. The results are almost independent of Hdr, so the choice 
is driven mainly by computational reasons and we set tidr = 2000. Finally, the 
corresponding plot of Figure [5] gives an idea of the Na value that one has to 
choose, low enough to mainly start the DR with a big jump in the frequency 
space, but not too much in order to avoid a negligible acceptance probability; 
in our case we'll work with Na = 0.15. During the normal run, we enter into 
the DR routine with a probability of 10"'^. 

The correlations between the frequency, /o, and the sky location, {</>, 9}, 
force us also to update these two parameters during the DR chain, in order 
to prevent the efficiency to become very small. However, since the multimodal 
structure of the target distribution as a function of these two parameters is 
smoother, it is sufficient to draw their values from single Gaussian distributions, 
which reduces the irreversibility of the proposals to just the signal frequency, 
i.e. all the results obtained for a single parameter in Sec. [3] are completely valid 
here. 

In Fig. [TO] we compare the performance of the DR MCMC implementation 
just described above (case C in this figure), with two approaches based on 
the application of standard MCMC, one proposing big jumps with probability 
Pb.i = 10~^ (case A) and the other attempting them with higher probability, 
Pbj — 2/3 (case B). Notice that the choice of case A is based on keeping the 
same proportion of elements in the chain generated from small and big jumps in 
the parameter space as in case C, whereas in case B it is the ratio of attempted 
transitions that is kept constant from case C. In order to make a fair comparison, 
it is also important to notice that the computational power required to generate 
N elements with the DR MCMC algorithm (case C) is almost three times higher 
than for the other two case^ (A and B), i.e. given a certain computational time, 
we will be able to generate 3A^ elements of the chains for cases A and B and 
just N elements for case C. For all these three cases, we assume the worst case 
scenario, starting the chains in a secondary maximum of the target distribution. 

By looking at the chains (left panels of Fig. [TO]) one see that either B or 
C are very efficient in exploring different maxima of the likelihood function, 
whereas the chain of case A needs 20 times more iterations to accept such 
transition. On the other hand, when looking at the autocorrelation functions 
(right panels of Fig. [TO]) we notice that the chain generated in case B is much 



"With the DR MCMC approach, on average, every 1000 iterations we will have 999 elements 
drawn from a standard proposal and 1 DR involving rios = 2000 evaluations of the likelihood 
function; so their computational cost (which is dominated by the likelihood evaluation that 
requires a time tHfc.eual.) wiU be 2999 tuk.nval. in comparison to the 1000 tnj^^^^al. of a 
standard MCMC. 
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Figure 10: Markov chains (left panels) and autocorrelation functions after making the tran- 
sition to the main mode of the of the target distribution (right panels) obtained by searching 
for a gravitational wave signal from a single white dwarf binary observed by the Laser In- 
terferometer Space Antenna with three different (Delayed Rejection) Markov chain Monte 
Carlo algorithms: (A) a standard MCMC method attempting big jumps in the parameter 
space with probability Pbi = 10~^, (B) a standard MCMC with pei = 2/3; and (C) the DR 
MCMC algorithm described in this paper, entering into the Delayed Rejection routine with a 
probability of 10""^ and allowing up to ?1dr = 2000 iterations to explore the new region of the 
parameter space. For this particular example, the proposal distributions during the DR are a 
mixture of three Gaussian functions for the signal frequency (parametrised by cri = 0.45 yr~^; 
02 = 0.2 yr-l; fi = 1.25 yr"!; Na = 0.15; Nf, = 0.95 and noR = 2000) and two single Gaus- 
sians for the two sky locations, {(/> , 9}. From top to bottom the plots refer to frequency /, 
co-latitude, 9, and longitude, <j>. 
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more autocorrelated that the other two; in fact, one can get a more quantitative 
measurement of this quantity by computing the Tint from its definition, Eq. ([6]): 
{TA,TB,Tc)intj = (1112,3089,881); {TA,TB,Tc)int,8 = (49.7,141.1,47.2) and 
{TA,TB,Tc)int,<t, = (1128,2813,868). Even taking into account that for the 
same computational time, Nb — 3A^c; the integrated autocorrelation times for 
B, Tint^B, are more than 3 times larger than those of chain C, Tint,c and therefore 
the variances obtained from C will be smaller than those from B. 

In conclusion, we can see that the chain obtained from the DR approach 
(case C) is the preferred one when we combine the two requirements, having 
an algorithm that (i) efficiently explores multiple maxima and (ii) reduces the 
variance of the estimates once it has reached the stationary distribution. 



6. Conclusions and Future Work 

Multimodal target distributions characterise a wide variety of problems in 
many fields, reducing dramatically the efficiency of a Markov chain Monte Carlo 
sampling. Many of these problems have been successfully solved with the intro- 
duction of techniques that promote the movement of the chains by, e.g. running 
parallel chains or making the target distribution temporarily smoother. These 
lines of attack fail however for complicated structures of the target distribution 
in many dimensions, that result in different localised 'islands' where the function 
reaches high values (see Figure [T]) . 

In this paper we have presented a fully Markovian algorithm capable of 
tackling these kind of problems when one has some previous knowledge about the 
target distribution structure, resulting in a efficient MCMC method to explore 
multimodal distributions. Our algorith m is based on the appl ic ation of the 
Delayed Rejection scheme introduced bv lTiernev and Mira (ll999l ): lMiral ()200lh 



with an arbitrary number of steps, whose successful implementation requires to 
tackle a number of non-trivial problems. In particular, we have explored the 
choice of proposal distributions to achieve finite acceptance probabilities (Sec. [3]) 
and we have provided details for the numerical implementation of the algorithm 
(Sec. H]). Finally, we have shown the benefits of this method using an example 
taken from the search for a gravitational wave signal in simulated data of the 
Laser Interferometer Space Antenna (Sec. [5]). 

An additional benefit of the Delayed Rejection algorithm described in this 
article is the possibility of combining the efficient exploration of the target dis- 
tribution with the estimation of Bayes factors and p erforming mo del selection 
(for instance, using the Reversible Jump algorithm ( Green . [l995l )). This is a 



necessary extension of the approach presented here in order to successfully ad- 
dress the problem of searching for, and estimating the parameters of an unknown 
number of gravitational wave signals, which we are currently investigating. 
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A. Analytical expressions 

In Section [31 we have analysed the ratio of proposal distributions that ap- 
pear in Equation ([¥]), providing numerical results for its expected value in all 
the parameter space (Figs. [n]and[7|). In this section we try to do the same 
analytically, getting an expression for the losses due to make an asymmetrical 
proposal (AP), Figure[5l valid when the three Gaussians are well separated (i.e. 
Gil PL small) and understanding why it's so important to always propose values 
from a distribution centred at the same location. We also justify why the other 
results can only be obtained numerically. 

In general, using Bayes' theorem, one can derive an expression for the prob- 
ability distribution of /' = f{x), x being a random variable generated from a 
known probability density function, q{x): 

p (/' = /(^)) = E Jmt ' ^^""'^ = ^' ■ ^^^^ 

i I dx 

The derivation is made rewriting p{f' = f{x)) as the marginalised probability 
distribution of p(/(x), x) over cc, and then writing the joint probability function, 
p{f{x), x), as the product of p{x) times p{f{x)\x), this last term being a delta 
function: 5{f' — f{x)). After expanding the delta function as the sum over all 
Xi such that f{xi) — f one finally gets Eq. (|^ . 

We want to apply the result from Eq. (^5)1 to compute the probability distri- 
bution of every factor that appear in Eq. ([9]), i.e. the ratio of proposals from the 
final acceptance probability. In order to do this, we need to know the probabil- 
ity distribution from where Pi were drawn (3-Gaussian functions, as is discussed 
in Sec. 13. ip and the functions we use to evaluate the elements of the chain (also 
3-Gaussian distributions, with not necessarily the same parameters). Next step 
would be finding all Xi such that f{xi) = /', which for intermediate values of 
/' and having 3-Gaussian functions would provide six solutions. In order to 
simplify the analytical calculations, we make use of the fact that the proposing 
and evaluating functions are always almost centred at the same location and we 
assume that their three modes are well separated (/x >> cr^), which allows to 
only have to evaluate in Eq. ((25)) contributions from a single Gaussian for each 
Xi] i.e. Equation (|25p will be a sum over six terms, where i = {1,2} will only 
involve contributions from the leftmost Gaussian, i = {3, 4} from the central 
mode and i = {5, 6} from the rightmost one. 
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We can distinguish three kind of terms from Equation © : (a) all the terms 
in the denominator, which involve evaluating the elements of the chain with 
the same functions they were generated; (b) a couple of terms in the numerator 
in which the elements are evaluated with a 3-Gaussian function, with different 
relative weight of the modes as the proposal distribution used to generate them; 
and finally, (c) the rest of the terms, that evaluate the chain with a function 
a little bit shifted (i.e. not centred at the same location) with respect to the 
actual proposal. In what follows, we are computing the probability distribution 
of each of these mentioned terms with the aim to combine the results and get 
an analytical expression for the probability distributions of the proposals ratio, 
Eq. ®: 

• All the terms in the denominator, qj{X,f3i, . . . correspond to evalu- 
ate the elements of the chain with the same functions as they were gener- 
ated. Assuming that both 3-Gaussian functions can be parametrised bj|f| 
{(Ti, (72, /i, A^} and making use of the result of Eq. ([^5]) . one gets the ex- 
pected distribution for a single term of the denominator of Equation Q , 
let us call it /', 



E[loi 




• In the numerator, we evaluate the elements in reverse order, which pro- 
duces terms with two kinds of asymmetries 

— In the first and last elements of the product (see Eq. pTjl ). we are 
evaluating a set of random variables generated from a 3-Gaussian 
distribution parametrised by {cti, (72, A'^} with another 3-Gaussian 
function, identical to the first one except for the relative weight of 
each mode, i.e. parametrised by {ai, (72, /i, M}. The expected distri- 
bution of one of these terms that appear in the numerator of Eq. ([9]) 
can also be derived analytically. 



p(log/') = 
E[\ogf'] = - 




^See Equation ||12|I and Figure |4] for the meaning of each parameter. 
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All the other terms, Eq. have the same relative weights between 
the Gaussians that generate and evaluate the random variables, but 
it can happen that the central location of the evaluation function is 
a little shifted with respect to the distribution used to generate the 
chain. Thus, we have that q{x) is parametrised by {cti, tT2, Mi -^l; but 
the elements drawn from it are evaluated with a function f{x) which 
parameters are {ai,a2, ^ + A, TV}. Following the same procedure, 
the expected distribution for any of the terms that appear in the 
numerator of Eq. Q , is 



2(71 e ^e'°«-'"cosh 



p(log/') = 



-21og(V2^^/' 



-21og(y2^^/') 



iaoe ^e'°s/'cosh 



-21og(y2^^/') 



(30) 



E[logf'] 



log{^ 



'271 



N 



2al 



log V27r 



2ct2 



-N 
(31) 

Since the mean of the sum of several random variables is the sum of their 
means, the expected value of the losses in acceptance probability due to an 
asymmetrical proposal (see Sec. 13.11) may be computed as the sum of four 
means values from the previous results of Eqs. ((27)) and ((29)) . The result is 
Equation (jl3p . which, due to the assumptions made here, is only valid when the 
distance between Gaussians, /x, is much bigger than their typical width, cr^. In 
Figure ini it is plotted the validity range of the analytical result, which agrees 
with the numerical results of Figure [5] as <Ji/ ^, is smaller. 

The expected standard deviations of the full ratio of proposal functions, 
can not easily be derived analytically from Eqs. ([26]) - ([31]) because of the 
correlations between variables. By combining the results from Eqs. (|27p and 
([3T|) (see the output result in Equation (fT5|) ) one gets the idea of the crucial 
importance of working with some proposals always centred at almost the same 
location (i.e. A ~ 0), but it's not possible to analytically derive a general 
expression for the total losses due the central proposal evolution (CPE) because 
of two reasons, (a) we don't know the behaviour of A as we add more terms 
to compute x and (b) even knowing the expected value of A for each stage, we 
couldn't straightforwardly use this result since A doesn't appear linearly in the 
expression for the losses due to CPE. 
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